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Abstract 

Hierarchical  reinforcement  learning  (HRL)  is  a  general  framework  that  studies  how  to 
exploit  the  structure  of  actions  and  tasks  to  accelerate  policy  learning  in  large  domains. 
Prior  work  on  HRL  has  been  limited  to  the  discrete-time  discounted  reward  semi-Markov 
decision  process  (SMDP)  model.  In  this  paper  we  generalize  the  setting  of  HRL  to  average- 
reward !,  continuous-time  and  multi-agent  SMDP  models.  We  also  describe  experimental 
results  from  a  large-scale  real-world  domain,  attesting  to  the  benefits  of  HRL  generally, 
and  to  our  extensions  more  specifically. 

Although  in  principle  any  HRL  framework  could  suffice,  we  focus  in  this  paper  on 
the  MAXQ  framework.  We  describe  three  new  hierarchical  reinforcement  learning  algo¬ 
rithms:  continuous-time  discounted  reward  MAXQ ,  discrete-time  average  reward  MAXQ, 
and  continuous-time  average  reward  MAXQ.  We  also  investigate  the  use  of  hierarchical 
reinforcement  learning  to  speed  up  the  acquisition  of  cooperative  multiagent  tasks.  We 
extend  the  MAXQ  framework  to  the  multiagent  case  which  we  term  cooperative  MAXQ, 
where  each  agent  uses  the  same  task  hierarchy.  Learning  is  decentralized,  with  each  agent 
learning  three  interrelated  skills:  how  to  perform  subtasks,  which  order  to  do  them  in,  and 
how  to  coordinate  with  other  agents.  Coordination  skills  among  agents  are  learned  by  using 
joint  actions  at  the  highest  level (s)  of  the  hierarchy. 

We  use  two  experimental  testbeds  to  study  the  empirical  performance  of  our  proposed 
extensions.  One  domain  is  a  simulated  robot  trash  collection  task.  The  other  domain  is 
a  much  larger  real-world  multi-agent  autonomous  guided  vehicle  (MAGV)  problem.  We 
compare  the  performance  of  our  proposed  algorithms  with  each  other,  as  well  as  with  the 
original  MAXQ  method  and  to  standard  Q-learning.  In  the  MAGV  domain,  we  show  that 
our  proposed  extensions  outperform  widely  used  industrial  heuristics,  such  as  “ first  come 
first  serve” ,  ”  highest  queue  first ”  and  ”  nearest  station  first’  ■ 

Keywords:  Hierarchical  Reinforcement  Learning,  Multiagent  Systems,  Semi-Markov 

Decision  Processes,  Average-Reward. 
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1.  Introduction 

Reinforcement  learning  (RL)  (Sutton  and  Barto,  1998)  is  a  class  of  problems  whereby  agents 
must  learn  what  actions  to  take  in  different  situations  or  states  in  order  to  maximize  a  scalar 
feedback  function  (or  reward )  over  time.  The  mapping  from  states  to  actions  is  referred 
to  as  a  policy  or  a  closed-loop  plan.  Learning  occurs  based  on  the  idea  that  the  tendency 
to  produce  an  action  should  be  reinforced  if  it  produces  favorable  long-term  rewards,  and 
weakened  otherwise.  From  the  perspective  of  control  theory,  RL  algorithms  can  be  shown 
to  be  approximations  to  classical  approaches  to  solving  optimal  control  problems.  The  clas¬ 
sical  approaches  use  dynamic  programming  (DP)  (Bertsekas,  1995),  which  requires  perfect 
knowledge  of  the  system  dynamics  and  payoff  function.  Reinforcement  learning  has  the 
advantage  of  potentially  being  able  to  find  optimal  solutions  (or  close-to-optimal)  solutions 
in  domains  where  models  are  not  known  or  unavailable. 

Instead  of  learning  the  policy  directly,  most  approaches  in  RL  learn  an  indirect  target 
function,  referred  to  as  the  value  function  as  it  represents  the  long-term  payoff  associated 
with  states  or  state-action  pairs.  The  policy  can  be  recovered  from  a  value  function  by 
choosing  “greedy”  actions  that  maximize  the  value  of  states  nearby  (or  immediate  state 
action  pairs).  Although  it  is  possible  to  show  theoretically  that  RL  algorithms,  such  as 
Q-learning  (Watkins,  1989)  or  TD(A)  (Sutton,  1988),  converge  asymptotically  to  produce 
the  optimal  value  function  (one  that  dominates  all  other  value  functions  over  the  state 
space),  convergence  is  only  assured  in  restricted  cases.  Often,  real-world  problems  require 
using  nonlinear  function  approximators  for  which  convergence  is  not  guaranteed.  Moreover, 
even  if  asymptotic  convergence  was  theoretically  guaranteed  by  using  a  restricted  class 
of  parametric  approximators,  in  practice  one  is  often  interested  in  convergence  within  a 
reasonable  time  bound.  Real-time  convergence  is  usually  extremely  slow  with  algorithms 
such  as  Q-learning,  when  they  are  combined  with  nonlinear  function  approximators,  such 
as  neural  nets.  For  example,  in  the  well-known  multiagent  elevator  task  (Crites  and  Barto, 
1998),  convergence  takes  on  the  order  of  millions  of  steps  of  simulated  time. 

The  central  focus  of  this  paper  is  to  present  new  algorithms  for  reinforcement  learn¬ 
ing,  applicable  to  continuous-time  multiagent  tasks  such  as  the  elevator  problem,  using 
which  convergence  occurs  much  more  rapidly  than  with  traditional  Q-learning.  The  new 
algorithms  are  based  on  extending  hierarchical  reinforcement  learning  (HRL),  a  general 
framework  for  scaling  reinforcement  learning  to  problems  with  large  state  spaces  by  using 
the  task  (or  action)  structure  to  restrict  the  space  of  policies.  The  key  principle  underly¬ 
ing  HRL  is  to  develop  learning  algorithms  that  do  not  need  to  learn  policies  from  scratch, 
but  instead  reuse  existing  policies  for  simpler  sub-tasks  (or  macro  actions).  The  difficulty 
with  using  the  traditional  framework  for  reusing  learned  policies  is  that  decision  making  no 
longer  occurs  in  synchronous  unit-time  steps,  as  is  traditionally  assumed  in  RL.  Instead, 
decision-making  occurs  in  epochs  of  variable  length,  such  as  when  a  distinguishing  state  is 
reached  (e.g.,  an  intersection  in  a  robot  navigation  task),  or  a  subtask  is  completed  (e.g., 
the  elevator  arrives  on  the  first  floor).  Furthermore,  these  variable  length  intervals  cause 
the  system  dynamics  to  be  non-Markov  with  respect  to  immediate  state  transitions,  since 
the  state  resulting  from  a  macro  action  might  depend  on  all  states  that  occurred  since  that 
macro  was  initiated.  For  example,  a  macro  action  undertaken  by  a  robot  for  cleaning  a 
room  by  sweeping  the  floor  twice  and  then  exiting  will  require  remembering  when  the  robot 
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started  (you  cannot  predict  its  next  state  merely  by  observing  that  the  robot  was  near  the 
door) . 

Fortunately,  a  well-known  statistical  model  is  available  to  treat  variable  length  state  dy¬ 
namics:  the  semi-Markov  decision  process  (SMDP)  model.  Here,  state  transition  dynamics 
is  specified  not  only  by  the  state  where  an  action  was  taken,  but  also  parameters  specifying 
the  length  of  time  since  the  action  was  taken.  Work  in  HRL  to  date,  including  hierarchical 
abstract  machines  (HAMs)  (Parr,  1998),  options  (Sutton  et  al.,  1999)  and  MAXQ  (Diet- 
terich,  2000),  have  used  a  discrete-time  discounted  SMDP  framework.  However  a  wide  class 
of  tasks  have  continuous-time  nature.  Therefore,  these  frameworks,  which  are  limited  to 
the  discrete-time  SMDP  model,  should  be  generalized  to  continuous-time  SMDPs  and  also 
to  the  average-reward  framework.  A  primary  goal  of  most  of  these  tasks,  such  as  AGV 
scheduling,  queuing  and  inventory  control,  is  to  find  a  gain  optimal  policy  that  maximizes 
(minimizes)  the  long-run  average  reward  (cost).  Although  average  reward  RL  has  been 
extensively  studied,  using  both  the  discrete-time  MDP  model  (Schwartz,  1993,  Mahadevan, 
1996,  Tadepalli  and  Ok,  1996a)  as  well  as  the  continuous-time  SMDP  model  (Mahadevan 
et  al.,  1997,  Wang  and  Mahadevan,  1999),  prior  work  has  been  limited  to  “flat”  algorithms. 
Therefore,  it  is  necessary  to  extend  hierarchical  reinforcement  learning  methods  such  as 
MAXQ,  which  have  been  limited  to  discounted  reward  SMDP  model,  to  the  average  reward 
SMDP  framework. 

One  of  the  principal  contributions  of  this  paper  is  to  extend  HRL  to  more  interesting 
(and  practical)  settings,  including  continuous-time  SMDP  models,  average-reward  optimal¬ 
ity  models,  and  also  to  multi- agent  domains.  These  extensions  make  HRL  more  widely 
applicable  to  many  interesting  practical  problems,  including  the  elevator  domain,  Robosoc- 
cer  (Stone  and  Veloso,  1999),  transfer  line  production  control  (Wang  and  Mahadevan,  1999), 
multi-agent  autonomous  guided  vehicle  (MAGV)  scheduling,  and  so  on. 

We  focus  our  extension  of  HRL  to  the  MAXQ  framework,  although  in  principle,  our 
approach  can  be  applied  also  to  HAMS  and  options.  We  describe  three  new  hierarchical  re¬ 
inforcement  learning  algorithms:  continuous-time  discounted  reward  MAXQ ,  discrete-time 
average  reward  MAXQ ,  and  continuous-time  average  reward  MAXQ.  We  also  extend  the 
MAXQ  framework  to  the  multiagent  case  which  we  term  cooperative  MAXQ ,  where  each 
agent  uses  the  same  task  hierarchy  to  cooperatively  learn  complex  tasks.  Learning  is  decen¬ 
tralized,  with  each  agent  learning  three  interrelated  skills:  how  to  perform  subtasks,  which 
order  to  do  them  in,  and  how  to  coordinate  with  other  agents.  Coordination  skills  among 
agents  are  learned  by  using  joint  actions  at  the  highest  level(s)  of  the  hierarchy. 

We  use  two  experimental  testbeds  to  study  the  empirical  performance  of  our  proposed 
extensions.  One  domain  is  a  simulated  robot  trash  collection  domain.  The  other  domain 
is  a  much  larger  real-world  multi-agent  autonomous  guided  vehicle  (MAGV)  domain.  We 
compare  the  performance  of  our  proposed  algorithms  with  each  other,  as  well  as  with  the 
original  MAXQ  method  and  to  standard  Q-learning.  In  the  MAGV  domain,  we  show  that 
our  proposed  extensions  outperform  widely  used  industrial  heuristics,  such  as  “ first  come 
first  serve ” ,  ”  highest  queue  firsf  and  ”  nearest  station  first ” . 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  provides  an  overview  of  multi¬ 
agent  reinforcement  learning.  Section  3  introduces  the  continuous-time  SMDP  framework 
under  both  discounted  and  average  reward  paradigms.  Section  4  describes  the  original 
discrete-time  discounted  MAXQ  framework,  and  illustrates  it  using  a  multiagent  robot 
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trash  collection  task.  Section  5,  Section  6,  and  Section  7  describe  the  continuous-time 
discounted  reward  MAXQ,  discrete-time  average  reward  MAXQ,  and  continuous-time  aver¬ 
age  reward  MAXQ  algorithms,  respectively.  Section  8  describes  the  multiagent  cooperative 
MAXQ  algorithm.  Section  9  describes  the  real- world  multiagent  testbed  of  AGV  scheduling. 
Section  10  presents  experimental  results  of  using  proposed  algorithms  in  the  trash  collec¬ 
tion  task  and  the  multiagent  AGV  scheduling  problem.  Finally,  Section  11  summarizes  the 
paper  and  discusses  some  directions  for  future  work. 


2.  Hierarchical  Multiagent  Reinforcement  Learning 

Consider  sending  a  team  of  robots  to  carry  out  reconnaissance  of  an  indoor  environment 
to  check  for  intruders.  This  problem  is  naturally  viewed  as  a  multiagent  task  (Weiss, 
1999).  The  most  effective  strategy  will  require  coordination  among  the  individual  robots. 
A  natural  decomposition  of  this  task  would  be  to  assign  different  parts  of  the  environments, 
for  example  rooms,  to  different  robots.  In  this  paper,  we  are  interested  in  learning  algorithms 
for  such  cooperative  multiagent  tasks,  where  the  agents  learn  the  coordination  skills  by  trial 
and  error.  The  key  idea  underlying  our  approach  is  simply  that  coordination  skills  are 
learned  much  more  efficiently  if  the  robots  have  a  hierarchical  representation  of  the  task 
structure  (algorithms  for  learning  task-level  coordination  have  been  developed  in  non-MDP 
approaches,  see  (Sugawara  and  Lesser,  1998).  In  particular,  rather  than  each  robot  learning 
its  response  to  low-level  primitive  actions  of  the  other  robots  (for  instance  if  robot-1  goes 
forward,  what  should  robot-2  do),  it  learns  high-level  coordination  knowledge  (what  is  the 
utility  of  robot-2  searching  room-2  if  robot-1  is  searching  room-1,  and  so  on). 

Multiagent  reinforcement  learning  has  been  recognized  to  be  very  challenging,  since  the 
number  of  parameters  to  be  learned  increases  dramatically  with  the  number  of  agents.  In  ad¬ 
dition,  when  agents  carry  out  actions  in  parallel,  the  environment  is  usually  non-stationary 
and  often  non-Markovian  as  well  (Mataric,  1997).  We  do  not  address  the  non-stationary  as¬ 
pect  of  multiagent  learning  in  this  paper.  One  approach  that  has  been  successful  in  the  past 
is  to  have  agents  learn  policies  that  are  parameterized  by  the  modes  of  interaction  (Wang 
and  Mahadevan,  1999).  Prior  work  in  multiagent  reinforcement  learning  can  be  decomposed 
into  work  on  competitive  models  vs.  cooperative  models.  Littman  (Littman,  1994),  and 
Hu  and  Wellman  (Hu  and  Wellman,  1998),  among  others,  have  studied  the  framework  of 
Markov  games  for  competitive  multiagent  learning. 

Here,  we  are  primarily  interested  in  the  cooperative  case.  The  work  on  cooperative 
learning  can  be  further  separated  based  on  the  extent  to  which  agents  need  to  communicate 
with  each  other.  For  example,  in  completely  accessible  models  such  as  multiagent  Markov 
decision  processes  (MMDP)  (Boutilier,  1999),  agents  have  complete  access  to  both  the 
global  state  and  the  global  joint  action.  Tan  (Tan,  1993)  and  Xuan  (Xuan  et  al.,  2001) 
also  exemplify  the  approach  of  requiring  communication  of  global  states  and/or  actions  at 
every  step.  Tan  extends  flat  Q-learning  to  multiagent  learning  by  using  joint  state-action 
values.  This  approach  requires  communication  of  states  and  actions  at  every  step.  Xuan 
assumes  agents  have  to  communicate  to  obtain  other  agent’s  local  state  information.  In  the 
elevator  domain  (Crites  and  Barto,  1998),  agents  share  a  common  state  description  and  a 
global  reinforcement  signal,  but  do  not  model  joint  actions. 
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There  are  also  studies  of  multiagent  learning  which  do  not  model  joint  states  or  actions 
explicitly,  such  as  by  Balch  (Balch  and  Arkin,  1998)  and  Mataric  (Mataric,  1997),  among 
others.  In  such  systems,  each  robot  maintains  its  position  in  the  formation  depending  on 
the  locations  of  the  other  robots,  so  there  is  some  implicit  communication  or  sensing  of 
states  and  actions  of  other  agents.  There  has  also  been  work  on  reducing  the  parameters 
needed  for  Q- learning  in  multiagent  domains,  by  learning  action  values  over  a  set  of  derived 
features(Stone  and  Veloso,  1999).  These  derived  features  are  domain-specific,  and  have  to 
be  encoded  by  hand,  or  constructed  by  a  supervised  learning  algorithm. 

Our  approach  differs  from  all  the  above  in  one  key  respect,  namely  the  use  of  HRL  to 
speed  up  cooperative  multiagent  reinforcement  learning.  We  assume  each  agent  is  given 
an  initial  hierarchical  decomposition  of  the  overall  task  (as  described  below,  we  adopt  the 
MAXQ  framework).  However,  the  learning  is  distributed  since  each  agent  has  only  a  local 
view  of  the  overall  state  space.  Furthermore,  each  agent  learns  joint  abstract  action-values 
by  communicating  with  each  other  only  the  high-level  subtasks  that  they  are  doing.  Since 
high-level  tasks  can  take  a  long  time  to  complete,  communication  is  needed  only  fairly 
infrequently  (this  is  another  significant  advantage  over  flat  methods). 

A  further  advantage  of  the  use  of  hierarchy  in  multiagent  learning  is  that  it  makes  it 
possible  to  learn  coordination  skills  at  the  level  of  abstract  actions.  The  agents  learn  joint 
action  values  only  at  the  highest  level(s)  of  abstraction  in  the  proposed  framework.  This 
allows  for  increased  cooperation  skills  as  agents  do  not  get  confused  by  low  level  details. 
In  addition,  each  agent  has  only  local  state  information,  and  is  ignorant  about  the  other 
agent’s  location.  This  is  based  on  the  idea  that  in  many  cases,  an  agent  can  get  a  rough 
idea  of  what  state  the  other  agent  might  be  in  just  by  knowing  about  the  high  level  action 
being  performed  by  the  other  agent.  Also,  keeping  track  of  just  this  information  greatly 
simplifies  the  underlying  reinforcement  learning  problem. 

These  benefits  can  potentially  accrue  with  using  any  type  of  hierarchical  learning  al¬ 
gorithm,  though  in  this  paper  we  only  describe  results  using  the  MAXQ  framework.  The 
reason  that  we  decided  to  adopt  the  MAXQ  framework  as  a  basis  for  our  multiagent  algo¬ 
rithm  is  the  fact  that  the  MAXQ  method  stores  the  value  function  in  a  distributed  way  in 
all  nodes  in  the  subtask  graph.  The  value  function  is  propagated  upwards  from  the  lower 
level  nodes  whenever  a  high  level  node  needs  to  be  evaluated.  This  propagation  enables  the 
agent  to  simultaneously  learn  subtasks  and  high  level  tasks.  Thus,  by  using  this  method, 
agents  learn  the  coordination  skills  and  the  individual  low  level  tasks  and  subtasks  all  at 
once. 

However,  it  is  necessary  to  generalize  the  MAXQ  framework  to  make  it  more  applica¬ 
ble  to  multiagent  learning.  A  broad  class  of  multiagent  optimization  tasks,  such  as  AGV 
scheduling,  can  be  viewed  as  discrete-event  dynamic  systems.  For  such  tasks,  the  termina¬ 
tion  predicate  used  in  MAXQ  has  to  be  redefined  to  take  care  of  the  fact  that  the  completion 
of  certain  subtasks  might  depend  on  the  occurrence  of  an  event  rather  than  just  a  state  of 
the  environment. 

3.  Semi-Markov  Decision  Processes 

We  begin  with  a  review  of  continuous-time  semi-Markov  decision  processes.  Semi-Markov 
decision  processes  (SMDPs)  are  useful  in  modeling  temporally  extended  actions.  They 
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extend  the  discrete-time  MDP  model  in  several  aspects.  Time  is  modeled  as  a  continuous 
entity  and  decisions  are  only  made  at  discrete  points  in  time  (or  events).  The  state  of  the 
system  may  change  continually  between  decisions,  unlike  MDPs  where  state  changes  are 
only  due  to  actions. 

An  SMDP  is  defined  as  a  five  tuple  ( S,A,P,R,F ),  where  S'  is  a  finite  set  of  states,  A 
is  the  set  of  actions,  P  is  a  set  of  state  and  action  dependent  transition  probabilities,  R 
is  the  reward  function,  and  F  is  a  function  giving  probability  of  transition  times  for  each 
state-action  pair.  P(s'\s,a)  denotes  the  probability  that  action  a  will  cause  the  system  to 
transition  from  state  s  to  state  s'.  This  transition  is  at  decision  epochs  only.  Basically,  the 
SMDP  represents  snapshots  of  the  system  at  decision  points,  whereas  the  so-called  natural 
process  describes  the  evolution  of  the  system  over  all  times.  F(t\s,  a)  is  the  probability  that 
the  next  decision  epoch  occurs  within  t  time  units  after  the  agent  chooses  action  a  in  state 
s  at  a  decision  epoch.  From  F  and  P,  we  can  compute  $  by 

<h(f,  a)  =  P(s'\s,a)F(t\s,a) 

where  $  denotes  the  probability  that  the  system  will  be  in  state  s'  for  the  next  decision 
epoch,  at  or  before  t  time  units  after  choosing  action  a  in  state  s,  at  the  last  decision  epoch. 
In  discrete-time  semi-Markov  decision  processes,  P(s' ,  N\s,  a)  denotes  the  probability  that 
the  system  will  be  in  state  s'  for  the  next  decision  epoch,  at  or  before  N  time  steps  after 
choosing  action  a  in  state  s,  at  the  last  decision  epoch.  The  reward  function  for  continuous¬ 
time  SMDPs  is  more  complex  than  in  the  MDP  model.  In  addition  to  the  fixed  reward  of 
taking  action  a  in  state  s,  k(s,  a),  an  additional  reward  may  be  accumulated  at  rate  c(s' ,  s,  a) 
for  the  time  the  natural  process  remains  in  state  s'  between  decision  epochs.  Formally,  the 
expected  reward  between  two  decision  epochs,  given  that  the  system  is  in  state  s  and  chooses 
action  a  in  the  first  decision  epoch,  is  expressed  as 

r(s,  a)  =  k(s,  a)  +  Eg{  f  c(Wt,s,a)dt} 

Jo 

where  r  is  the  transition  time  to  the  second  decision  epoch  and  Wf  denotes  the  state  of  the 
natural  process  during  this  transition. 

3.1  Discounted  Models 

We  now  give  a  short  overview  of  infinite-horizon  discounted  semi-Markov  decision  processes 
(Puterman,  1994,  Bradtke  and  Duff,  1995).  We  assume  continuous-time  discounting  at  rate 
fj  >  0,  which  means  that  the  present  value  of  one  reward  unit  received  t  time  units  in  the 
future  equals  e~l3t.  In  this  model,  for  policy  7 r,  vn(s )  denotes  the  expected  infinite-horizon 
discounted  reward,  given  that  the  process  occupies  state  s  at  the  first  decision  epoch  and 
is  defined  by 

00  ran+ 1 

vn(s)  =  E*{^2e~Pan[k(sn,an)+  e~^t~<Jn)c{Wt,sn,an)dt]}  (1) 

n=0  ’'cr™ 

In  the  above  expression,  represent  the  times  of  successive  decision  epochs  and 

e-/3crn  transforms  the  reward  to  values  at  the  first  decision  epoch.  In  this  model,  the  expected 
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discounted  reward  between  two  decision  epochs  is  defined  as 

roc  _  ru 

r(s,a)  =  k(s,a)  +  /  Y"')/  e_^c(V,  s,  a)P(s'\s,  a)dt\F(du\s,  a)  (2) 

Jo  s  Jo 


Using  Equation  (2),  we  can  re-express  the  value  function  in  Equation  (1)  as 

roc 

vn(s)  =  r(s,n(s))  +  Y"'  /  e~l3tv7r(s')^(dt,s'\s,Tr(s)) 

s'eSJo 

The  action  value  function  Q’!T(s,  a)  represents  the  discounted  cumulative  reward  of  doing  an 
action  a  in  state  s  once,  and  then  following  policy  n  subsequently. 

roc 

Qn(s,  a)  =  r(s,  a)  +  Y''  P(s'\s,  a)  /  e~/3tQ7T(s' ,  n(s'))F(dt\s,  a) 

s'es  Jo 


3.2  Average  Reward  Models 

The  theory  of  infinite-horizon  semi-Markov  decision  processes  with  the  average  reward  crite¬ 
rion  is  more  complex  than  that  for  discounted  models  (Puterman,  1994,  Mahadevan,  1996). 
To  simplify  exposition  we  assume  that  for  every  stationary  policy,  the  embedded  Markov 
chain  has  a  unichain  transition  probability  matrix.  Under  this  assumption,  the  expected 
average  reward  of  every  stationary  policy  does  not  vary  with  the  initial  state.  In  this  section 
we  illustrate  both  discrete-time  and  continuous-time  average  reward  semi-Markov  decision 
processes. 

3.2.1  Discrete-Time  Average  Reward  Models 

For  policy  7 r,  state  s  6  S  and  number  of  time  steps  N  >  0,  vjf(s)  denotes  the  expected  total 
reward  generated  by  the  policy  7r  up  to  time  step  N,  given  that  the  system  occupies  state 
s  at  time  0  and  is  defined  as 


Vn(s) 


N—l 


Es  {  r(suiau)} 
u= 0 


The  average  expected  reward  or  gain  g*(s)  for  a  policy  7r  at  state  s  can  be  defined  by  taking 
the  limit  inferior  of  the  ratio  of  the  expected  total  reward  up  until  the  TVth  decision  epoch 
to  the  number  of  decision  epochs.  So,  the  gain  of  a  policy  gn(s)  can  be  expressed  as  the 
ratio 


r(s)  =  lim 

V  JV— >o< 


*)} 


N 


For  unichain  MDPs,  the  gain  of  any  policy  is  state  independent  and  we  can  write  gn(s)  =  gn . 
For  each  transition,  the  expected  number  of  transition  steps  is  defined  as: 

OO 

y(s,  a)  =  Eg{N}  =  ]T  A  £  P(s\  N\s,  a) 
n=o  s'es 
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In  discrete-time  unichain  average  reward  SMDPs,  the  expected  average  adjusted  sum  of 
rewards  hn  for  stationary  policy  ir  is  defined  as 

OO  OO 

h*(s)  =  E-{^[r(su,au)  -  g*(su)}}  =  ^{^[r(su,au)  -  g*]}  (3) 

u= 0  u= 0 

The  Bellman  equation  for  discrete-time  unichain  average  reward  SMDPs  is  defined  based 
on  the  h  function  in  Equation  (3)  and  can  be  written  as 

hn(s)  =  r(s,n(s))  -  gny(s,Tr{s))  +  ^  P(s',  N\s,  ir(s))hn(s') 

s'eS,N 

The  action  value  function  Rn(s,a)  represents  the  average  adjusted  value  of  doing  an  action 
a  in  state  s  once,  and  then  following  policy  it  subsequently. 

Rw(s,a)  =  r(s,a)-gny(s,a)+  ^  P(s',N\s,a)K*(s',ir(s')) 

s'eS,N 


3.2.2  Continuous-Time  Average  Reward  Models 


For  policy  n,  state  s  £  S  and  time  t  >  0,  vf(s)  denotes  the  expected  total  reward  generated 
by  the  process  up  to  time  t,  given  that  the  system  occupies  state  s  at  time  0  and  is  defined 
as 


ft— i 

n=0 


c{Wu,sVu,aVu)du} 


where  vu  is  the  number  of  decisions  made  up  to  time  t.  In  this  model,  the  expected  total 
reward  between  two  decision  epochs  is  defined  as 

roo  _  ru 

r(s,a)  =  k(s,a)  +  /  /.[/  g(s' ,  s,a)P(s'\s,a)dt\F(du\s,a) 

Jo  S,es  Jo 


The  average  expected  reward  or  gain  g7T(s)  for  a  policy  7r  at  state  s  can  be  defined  by  taking 
the  limit  inferior  of  the  ratio  of  the  expected  total  reward  up  until  the  nth  decision  epoch 
to  the  expected  total  time  until  the  nth  decision  epoch.  So,  the  gain  of  a  policy  gn(s)  can 
be  expressed  as  the  ratio 


.  ..  ^{ELo[Msuai)  +  /^i+1c(IEt,Si,a;)df]} 

g  (s)  =  Iim  - 

7i— >00 


OR} 


For  unichain  MDPs,  the  gain  of  any  policy  is  state  independent  and  we  can  write  gn(s)  =  gn . 
For  each  transition,  the  expected  transition  time  is  defined  as: 


y(s,a)  =  E“{t}  =  ty2$(dt,s'\s,a) 

Jo  s'es 


In  continuous-time  unichain  average  reward  SMDPs,  the  expected  average  adjusted  sum  of 
rewards  hn  for  stationary  policy  n  is  defined  as 


hw(s)  =  Vt*(s)  -  gnt 


(4) 
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where  t  is  the  time  at  which  the  decision  epoch  occurs.  The  Bellman  equation  for  unichain 
average  reward  SMDPs  is  defined  based  on  the  h  function  in  Equation  (4)  and  can  be 
written  as 

hn(s)  =  r(s,n(s))  -  gny(s,7r(s))  +  V]  P{s'\s,ir(s))  /  h7r(s,)F(dt\s,n(s 

S'es  Jo 


The  action  value  function  Rn(s,a)  represents  the  average  adjusted  value  of  doing  an  action 
a  in  state  s  once,  and  then  following  policy  it  subsequently. 


Rn(s ,  a) 


r(s ,  a) 


gny{s,a)  +  ^  P(s'\s,a) 
s'eS 


BA (s' ,  Tr(s'))F(dt\s,  a) 


4.  The  MAXQ  Framework 

The  reinforcement  learning  algorithms  introduced  in  this  paper  extend  the  MAXQ  value 
function  decomposition  developed  originally  in  the  context  of  the  discrete-time  SMDP  model 
(for  single  agents)  (Dietterich,  2000).  This  approach  involves  the  use  of  a  graph  to  store 
a  distributed  value  function.  The  overall  task  is  first  decomposed  into  subtasks  up  to  the 
desired  level  of  detail,  and  the  task  graph  is  constructed.  We  illustrate  the  idea  using  a 
simple  two-robot  search  task  shown  in  Figure  1.  Consider  the  case  where  a  robot  is  assigned 
the  task  of  picking  up  trash  from  trash  cans  over  an  extended  area  and  accumulating  it  into 
one  centralized  trash  bin,  from  where  it  might  be  sent  for  recycling  or  disposed.  This  is  a 
task  which  can  be  parallelized,  if  we  have  more  than  one  agent  working  on  it.  An  office 
(rooms  and  connecting  corridors)  type  environment  is  shown  in  figure.  A1  and  A2  represent 
the  two  agents  in  the  figure.  Note  the  agents  need  to  learn  three  skills  here.  First,  how  to 
do  each  subtask,  such  as  navigating  to  T1  or  T 2  or  Dump ,  and  when  to  perform  Pickup 
or  Putdown  action.  Second,  the  agents  also  need  to  learn  the  order  to  do  subtasks  (for 
instance,  go  to  T1  and  collect  trash  before  heading  to  the  Dump).  Finally,  the  agents  also 
need  to  learn  how  to  coordinate  with  other  agents  (i.e.  Agentl  can  pick  up  trash  from  T1 
whereas  Agent2  can  service  T 2).  The  strength  of  the  MAXQ  framework  (when  extended 
to  the  multiagent  case)  is  that  it  can  serve  as  a  substrate  for  learning  all  these  three  types 
of  skills. 

This  trash  collection  task  can  be  decomposed  into  subtasks  and  the  resulting  task  graph 
is  shown  in  figure  2.  The  task  graph  is  then  converted  to  the  MAXQ  graph,  which  is  shown 
in  figure  3.  The  MAXQ  graph  has  two  types  of  nodes:  MAX  nodes  (triangles)  and  Q  nodes 
(rectangles),  which  represent  the  different  actions  that  can  be  done  under  their  parents. 
Note  that  MAXQ  allows  learning  of  shared  subtasks.  For  example,  the  navigation  task  Nav 
is  common  to  several  parent  tasks. 

The  multiagent  learning  scenario  under  investigation  in  this  paper  can  now  be  illustrated. 
Imagine  the  two  robots  start  to  learn  this  task  with  the  same  MAXQ  graph  structure.  We 
can  distinguish  between  two  learning  approaches,  selfish  and  cooperative.  In  the  selfish  case, 
the  two  robots  learn  with  the  given  MAXQ  structure,  but  make  no  attempt  to  communicate 
with  each  other.  In  the  cooperative  case,  the  MAXQ  structure  is  modified  such  that  the  Q 
nodes  at  the  level(s)  immediately  under  the  root  task  include  the  joint  action  done  by  both 
robots.  For  instance,  each  robot  learns  the  joint  Q-value  of  navigating  to  trash  T1  when 
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Tl:  Location  of  one  trash  can. 

T2:  Location  of  another  trash  can. 

Dump:  Final  destination  location  for  depositing  all  trash. 

Figure  1:  A  (simulated)  multiagent  robot  trash  collection  task. 


Figure  2:  The  task  graph  for  the  trash  collection  task. 
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(  )  .  Q  Node 
T 1 :  Location  of  trash  1 
T2:  Location  of  Trash  2 
D:  Location  of  dump 

Figure  3:  The  MAXQ  graph  for  the  trash  collection  task. 

the  other  robot  is  either  navigating  to  T1  or  T 2  or  Dump  or  doing  a  Put  or  Pick  action. 
As  we  will  show  in  a  more  complex  domain  below,  cooperation  among  the  agents  results 
in  superior  learned  performance  than  in  the  selfish  case,  or  indeed  the  flat  case  when  the 
agents  do  not  use  a  task  hierarchy  at  all. 

More  formally,  the  MAXQ  method  decomposes  an  MDP  M  into  a  set  of  subtasks 
Mo,  Mi, ...,  Mn.  Each  subtask  is  a  three  tuple  (T;,  A*,  A)  defined  as: 

•  Ti{si )  is  a  termination  predicate  which  partitions  the  state  space  S  into  a  set  of  active 
states  Si,  and  a  set  of  terminal  states  Tj.  The  policy  for  subtask  M,;  can  only  be 
executed  if  the  current  state  s  G  Si. 

•  Ai  is  a  set  of  actions  that  can  be  performed  to  achieve  subtask  Mj.  These  actions  can 
either  be  primitive  actions  from  A,  the  set  of  primitive  actions  for  the  MDP,  or  they 
can  be  other  subtasks. 

•  Ri(s  |s,a)  is  the  pseudo  reward  function,  which  specifies  a  pseudo-reward,  for  each 
transition  from  a  state  s  G  Si  to  a  terminal  state  s  £  Tt.  This  pseudo-reward  tells 
how  desirable  each  of  the  terminal  states  is  for  this  particular  subtask. 

Each  primitive  action  a  is  a  primitive  subtask  in  the  MAXQ  decomposition,  such  that 
a  is  always  executable,  it  terminates  immediately  after  execution,  and  it’s  pseudo-reward 
function  is  uniformly  zero.  The  projected  value  function  V7r  is  the  value  of  executing  hierar¬ 
chical  policy  7 r  starting  in  state  s,  and  at  the  root  of  the  hierarchy.  The  outside  completion 
function  (C(i,s,a))  is  the  expected  cumulative  discounted  reward  of  completing  subtask 
Adi  after  invoking  the  subroutine  for  subtask  Ma  in  state  s.  It  is  computed  without  any  ref¬ 
erence  to  Ri.  This  completion  function  will  be  used  by  parent  tasks  to  compute  V(i,  s ),  the 
expected  reward  for  performing  action  i  starting  in  state  s.  The  inside  completion  function 
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( C(i,s ,  a))  is  a  completion  function,  which  is  used  only  inside  node  i  in  order  to  discover 
the  locally  optimal  policy  for  task  M;.  This  function  incorporates  rewards  both  from  the 
real  reward  function,  R(s'\s,a),  and  from  the  pseudo-reward  function,  Ri(s'). 

The  value  function  D(i,s)  for  doing  task  i  in  state  s  is  calculated  by  decomposing  it 
into  two  parts:  the  value  of  the  subtask  which  is  independent  of  the  parent  task,  and  the 
value  of  the  completion  of  the  task,  which  of  course  depends  on  the  parent  task. 

, .  .  _  J  maxa  Q(i,s,  a)  if  i  is  composite 

Z,S  \  Yls'  P(s'  I  s,i)R(s'  |  s,i)  if  i  is  primitive 


Q(i,  s,a)  =  V(a,  s)  +  C(i,s,a)  (5) 

where  Q(i,  s,  a)  is  the  action  value  of  doing  subtask  a  in  state  s  in  the  context  of  parent 
task  i. 

The  Q  values  and  the  C  values  can  be  learned  through  a  standard  temporal-difference 
learning  method,  based  on  sample  trajectories  (see  (Dietterich,  2000)  for  details).  One 
important  point  to  note  here  is  that  since  subtasks  are  temporally  extended  in  time,  the 
update  rules  used  here  are  based  on  the  SMDP  model. 

Let  us  assume  that  an  agent  is  in  state  s  while  doing  task  i,  and  chooses  subtask  j  to 
execute.  Let  this  subtask  terminate  after  N  steps  and  result  in  state  s' .  Then,  the  SMDP 
Q-learning  rule  used  to  update  the  outside  and  inside  completion  functions  are  given  by 

Ct+i(i,s,j)  <-  (1  -  at)Ct(i,s,j)  +  at^N[Ct(i,s' ,a*)  +  Vj(a*,s')] 

Ct+i(i,s,j)  <-  (1  -  at)Ct(i,s,j)  +  at^N[Ri{s')  +  Ct(i,s',a*)  +  Vt(a*,  s')] 
where  a*  =  argmaxai[Cf(i,  s',  a')  +  Vf(a',  s')]. 

A  hierarchical  policy  n  is  a  set  containing  a  policy  for  each  of  the  subtasks  in  the  problem: 
7T  =  {7To . .  .7 rn}.  The  projected  value  function  in  the  hierarchical  case,  denoted  by  V^s), 
is  the  value  of  executing  hierarchical  policy  7r  starting  in  state  s  and  starting  at  the  root 
of  the  task  hierarchy.  A  recursively  optimal  policy  for  MDP  M  with  MAXQ  decomposition 
{Mo  . . .  Mn}  is  a  hierarchical  policy  n  =  {7To  . . .  irn}  such  that  for  each  subtask  Mt  the 
corresponding  policy  7 r;  is  optimal  for  the  SMDP  defined  by  the  set  of  states  Si,  the  set  of 
actions  A*,  the  state  transition  probability  function  P'K{s  ,  N\s,  a),  and  the  reward  function 
given  by  the  sum  of  the  original  reward  function  R(s  |s,  a)  and  the  pseudo-reward  function 
Ri(s  ).  The  MAXQ  learning  algorithm  has  been  proven  to  converge  to  n*,  the  unique 
recursively  optimal  policy  for  MDP  M  and  MAXQ  graph  H,  where  M  =  (S,  A,  P,  R,  Pq) 
is  a  discounted  infinite  horizon  MDP  with  discount  factor  7,  and  H  is  a  MAXQ  graph 
defined  over  subtasks  {Mo  . . .  Mn}. 

5.  Continuous-Time  Discounted  Reward  MAXQ  Algorithm 

At  the  center  of  the  MAXQ  method  for  hierarchical  reinforcement  learning  is  the  MAXQ 
value  function  decomposition.  We  show  how  the  overall  value  function  for  a  policy  is 
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decomposed  into  a  collection  of  value  functions  for  individual  subtasks  for  the  continuous¬ 
time  discounted  reward  model.  The  projected  value  function  of  hierarchical  policy  n  on 
subtask  Mi,  denoted  V7r(,i,  s) ,  is  the  expected  cumulative  discounted  reward  of  executing 
7 Tj  (and  the  policies  of  all  descendants  of  Mt )  starting  in  state  s  until  Mj  terminates.  The 
value  Vn(i,  s )  has  the  following  form  in  the  continuous-time  discounted  reward  framework: 

OO 

V*(i,s)  =  e~^<Tnr(sn,  an)}  (6) 

n=0 

where  r(sn,  an)  is  defined  using  Equation  2.  Now  let  us  suppose  that  the  first  action  chosen 
by  7 Ti  is  invoked  and  executes  for  a  number  of  steps  N  and  terminates  in  state  s'  according 
to  P?(s'\s,a).  We  can  rewrite  Equation  6  as 

TV— 1  oo 

Vn(i,s)  =  Eg  {^2  e~^anr(sn,  an)  +  e~/3(TN  'JT  e~,3<7nr(sN+n,  ajv+n)}  (7) 

n= 0  n= 0 

The  first  summation  on  the  right-hand  side  of  Equation  7  is  the  discounted  sum  of  rewards 
for  executing  subroutine  7Tj(s)  starting  in  state  s  until  it  terminates,  in  other  words,  it  is 
W(7 rj(s),  s),  the  projected  value  function  for  the  child  task  Mn.tsy  The  second  term  on  the 
right-hand  side  of  the  equation  is  the  value  of  s'  for  the  current  task  i,  Vn{i,  s'),  discounted 
by  e~'3t,  where  s'  is  the  current  state  when  subroutine  7 Ti(s)  terminates  and  t  is  the  sample 
transition  time  from  state  s  to  state  s' .  We  can  write  Equation  7  in  the  form  of  a  Belmopan 
equation: 


Equation  8  can  be  re-stated  for  action- value  function  decomposition  as  follows: 

/•OO 

Qn(i,s,a)  =  +  y2  Pi(s'\s,a)  /  e_/3tQ7r(7,  s',  iri(s'))Fi(dt\s,  a) 

s'eSi 

The  right-most  term  in  this  equation  is  the  expected  discounted  cumulative  reward  of  com¬ 
pleting  task  Mi  after  executing  action  a  in  state  s.  This  term  is  called  the  completion 
function  and  is  denoted  by  C*{i,  s,a ).  With  this  definition,  we  can  express  the  Q  function 
recursively  as 


Q^(i,  s,  a)  =  Vn(a,  s )  +  Cn(i,  s,  a) 


and  we  can  re-express  the  definition  for  V  as 


Q*(i,S,  7Tj(s)) 
h(s,i)  +  f0°°  Zs,€Si 


if  i  is  composite 

Pi{s'\s,  i)[Jq  e~/3tci(s',  s,  i)dt)Fi(du\s,  i) 

if  i  is  primitive 


We  can  use  the  above  formulas  to  obtain  update  equations  for  value  function  V,  out¬ 
side  completion  function  C  and  inside  completion  function  C  in  the  continuous-time  dis¬ 
counted  reward  model.  Pseudo-code  for  the  resulting  algorithm  is  shown  in  Algorithm  1  1 
(Ghavamzadeh  and  Mahadevan,  2001). 

1.  We  use  the  notation  u  A—  v  ;n  Algorithm  1,  Algorithm  2  and  Algorithm  3  as  an  abbreviation  for  the 
stochastic  approximation  update  rule  «<— (1  —  a)u  +  av. 
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Algorithm  1  The  continuous-time  discounted  reward  MAXQ  algorithm. 

1:  function  MAXQ(MaxNode  i,  State  s) 

2:  let  Seq={}  be  the  sequence  of  (states  visited,  transition  times)  while  executing  i 
3:  if  i  is  a  primitive  MaxNode  then 

4:  execute  action  i  in  state  s,  observe  state  s'  in  r  time  units,  receive  lump  portion  of 

reward  k(s,i)  and  continuous  portion  of  reward  with  rate  r(s',  s,  i) 

5:  Vt+i(i,s)  [k(s,i)  +  r(s',s,i)\ 

6:  push  (state  s,  transition  time  r)  into  the  beginning  of  Seq 

7:  else 

8:  while  i  has  not  terminated  do 

9:  choose  action  a  according  to  the  current  exploration  policy  7 r*(s) 

10:  let  ChildSeq=MAXQ(a,s),  where  ChildSeq  is  the  sequence  of  (states  visited,  tran¬ 

sition  times)  while  executing  action  a 
11:  observe  result  state  s' 

12:  let  a*  =  argmaxa> eAi (s/)  [Ct(i,  s',  a')  +  Vt(a',  s')] 

13:  T  =  0; 

14:  for  (s,r)  in  ChildSeq  from  the  beginning  do 

15:  T  =  T+t 

16:  Ct+i(i,s,a)  e^l3T[Ri(s')  +  Ct(i,s',a*)  +  Vt{a*,s')} 

17:  Ct+i(i,  s,  a)  e~0T[Ct{i ,  s',  a*)  +  Vt{a* ,  s')] 

18:  end  for 

19:  append  ChildSeq  onto  the  front  of  Seq 

20:  s  =  s' 

21:  end  while 

22:  end  if 
23:  return  Seq 
24:  end  MAXQ 
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6.  Discrete-Time  Average  Reward  MAXQ  Algorithm 


We  now  describe  a  new  discrete-time  average  reward  hierarchical  reinforcement  learning 
algorithm  based  on  the  MAXQ  framework.  To  simplify  exposition,  we  assume  that  for 
every  possible  stationary  policy  of  each  subtask  in  the  hierarchy,  the  embedded  Markov 
chain  has  a  unichain  transition  probability  matrix.  Under  this  assumption  every  subtask 
in  the  hierarchy  is  a  unichain  SMDP.  This  means  the  expected  average  reward  of  every 
stationary  policy  for  each  subtask  in  the  hierarchy  does  not  vary  with  initial  state.  As  we 
mentioned  earlier,  value  function  decomposition  is  the  heart  of  the  MAXQ  method.  We 
show  how  the  overall  h  function  for  a  policy  is  decomposed  into  a  collection  of  h  functions 
for  individual  subtasks  in  the  discrete-time  average  reward  MAXQ  method.  The  projected 
h  function  of  hierarchical  policy  it  on  subtask  Mj,  denoted  hn{i,  s),  is  the  average  adjusted 
sum  of  rewards  earned  of  following  policy  (and  the  policies  of  all  descendants  of  Mt) 
starting  in  state  s  until  Mi  terminates: 

TV— 1 

hn(i’s)=  Jim  Es{yZ(r(su,au)  -  g1)}  (9) 

iV— KX)  Z ' 

tt=0 


where  g1  is  the  gain  of  subtask  Mt .  Now  let  us  suppose  that  the  first  action  chosen  by  7r 
is  invoked  and  executes  for  a  number  of  steps  N  and  terminates  in  state  s'  according  to 
if  (s',  N\s,  a).  We  can  write  Equation  9  in  the  form  of  a  Bellman  equation: 

hn(i,s)  =  r(s,7Ti(s))  -  gtyi(s,7Ti(s))  +  ^  R(s',  N\s,  iri(s))hn  (i,  s')  (10) 

s'£Si,N 


Since  r(s,  7Tj(s))  is  the  expected  total  reward  between  two  decision  epochs  of  subtask  i,  given 
that  the  system  occupies  state  s  at  the  first  decision  epoch  and  decision  maker  chooses  action 
7 Ti(s)  and  the  number  of  time  steps  until  next  decision  epoch  is  N,  we  have 

r(s,7Ti(s))  =  V*i{sMs))(i Ti(s),s)  =  hn(Tri(s),s)+g7ri{s)yi(s,TTi{s)) 

By  replacing  r(s,iri(s))  from  the  above  expression,  Equation  10  can  be  written  as 

h*(i,s)  =  K*(iTi(s),s)  -  (gl  -  gni^)yi(s,TTi(s))  +  ^  Pj(s',  N\s,  TTi(s))hn(i,  s')  (11) 

s'£Si,N 

We  can  re-state  Equation  11  for  action- value  function  decomposition  as  follows: 

Rn(i,  s,  a )  =  hw (a,  s)  -  (gl  -  ga)yi(s,  a)  +  ^  Pi(s',  N\s,  a)Rn(i,  s',  ir^s')) 

s'eSi,N 


In  the  above  equation,  the  term 

-(ff*  -  9a)yi{s,a)  +  ^2  pi(s'i  N\s,a)Rn(i,s',Tri(s')) 

s'eSi.TV 


denotes  the  average  adjusted  reward  of  completing  task  Mt  after  executing  action  a  in  state 
s.  This  term  is  called  the  completion  function  and  is  denoted  by  Cn(i,s,a).  With  this 
definition,  we  can  express  the  R  function  recursively  as 


Rn(i,  s,  a)  =  hv(a,  s)  +  Cn(i,  s,  a) 
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and  we  can  re-express  the  definition  for  h  as 

,n, .  \  ( Rn(i,  s,  7 Tj(s))  if  i  is  composite 

{  J2S'  *)[r(s'|s,  i)  —  g1}  if  i  is  primitive 

The  above  formulas  can  be  used  to  obtain  update  equations  for  h  function,  outside  comple¬ 
tion  function  C  and  inside  completion  function  C  in  the  discrete-time  average  reward  model. 
Pseudo-code  for  the  resulting  algorithm  is  shown  in  Algorithm  2.  As  mentioned  above,  all 
subtasks  in  the  hierarchy,  even  primitive  actions,  are  modeled  by  a  unichain  SMDP. 

7.  Continuous-Time  Average  Reward  MAXQ  Algorithm 

We  now  describe  a  new  average  reward  hierarchical  reinforcement  learning  algorithm  based 
on  the  MAXQ  framework.  To  simplify  exposition,  we  assume  that  for  every  possible  sta¬ 
tionary  policy  of  each  subtask  in  the  hierarchy,  the  embedded  Markov  chain  has  a  unichain 
transition  probability  matrix.  Under  this  assumption  every  subtask  in  the  hierarchy  is  a 
unichain  SMDP.  This  means  the  expected  average  reward  of  every  stationary  policy  for 
each  subtask  in  the  hierarchy  does  not  vary  with  initial  state.  As  we  mentioned  earlier, 
value  function  decomposition  is  the  heart  of  the  MAXQ  method.  We  show  how  the  overall 
h  function  for  a  policy  is  decomposed  into  a  collection  of  h  functions  for  individual  subtasks 
in  the  continuous-time  average  reward  MAXQ  method.  The  projected  h  function  of  hier¬ 
archical  policy  7 r  on  subtask  Mt ,  denoted  hn(i,s),  is  the  average  adjusted  sum  of  rewards 
earned  of  following  policy  7 n  (and  the  policies  of  all  descendants  of  Mi)  starting  in  state  s 
until  Mi  terminates: 


N—l 

h*(i,  s)  =  lim  E%s{y\(r(st,at)  -  glTt)}  (12) 

N — >oo  z ' 

t= 0 

where  r’s  and  gl  are  the  length  of  decision  epochs  and  gain  of  subtask  Mi  respectively.  Now 
let  us  suppose  that  the  first  action  chosen  by  n  is  invoked  and  executes  for  a  number  of 
steps  and  terminates  in  state  s'  according  to  Pf(s'\s,  a).  We  can  write  Equation  12  in  the 
form  of  a  Bellman  equation: 

poo 

hn(i,s)  =  r(s,7Ti(s))  -  glyi(s,7Ti(s))  +  Pi(s'\s,TTi(s))  /  K*(i,  s')Fi(dt\s,  7Tj(s))  (13) 

Jo 

Since  r(s,  i Tj(s))  is  the  expected  total  reward  between  two  decision  epochs  of  subtask  i,  given 
that  the  system  occupies  state  s  at  the  first  decision  epoch  and  decision  maker  chooses  action 
7Tj(s)  and  the  expected  length  of  time  until  next  decision  epoch  is  yi(s,  7 t*(s)),  we  have 

r(s,7Ti(s))  =  VydsMs)){  7Ti(s),s)  =  h7T(Tri(s),s)+g7ri('s)yi(s,iri(s)) 

By  replacing  r(s,7Ti(s))  from  the  above  expression,  Equation  13  can  be  written  as 

poo 

hn(i,s)  =  h?{i h(s),s)  -  (gl  -  gni^)yi(s,TTi(s))  +  V]  Pi(s'\s,  7q(s))  /  h*(i,  s')Fi(dt\s,  7T;(s)) 

,'65,  Jo 

(14) 
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Algorithm  2  The  discrete-time  average  reward  MAXQ  algorithm, 
function  MAXQ(MaxNode  i,  State  s) 

2:  let  Seq={}  be  the  sequence  of  (states  visited,  transition  times,  reward)  while  executing 

i 

if  i  is  a  primitive  MaxNode  then 

4:  execute  action  i  in  state  s,  receive  reward  r,  and  observe  state  s' 

ht+i(i,  s)  < —  (1  -  a)ht(i,  s)  +  o(r(s',  s,  i)  -  gl) 

6:  if  i  is  a  non-random  action  then 

update  average  reward  or  gain  of  subtask  i 
o.  ni  —  n+i(i)  _  n(i)+n 
st+l  nt+i(i)  nt(i)+ 1 

end  if 

10:  push  (state  s,  reward  rt)  into  the  beginning  of  Seq 

else 

12:  while  i  has  not  terminated  do 

choose  action  a  according  to  the  current  exploration  policy  7 r*(s) 

14:  let  ChildSeq=MAXQ(a,s),  where  ChildSeq  is  the  sequence  of  (states  visited,  tran¬ 

sition  times)  while  executing  action  a 
observe  result  state  s' 

16:  let  a*  =  argm,axaieAi(s')[Ct(i,  s',  a')  +  ht(a',  s')] 

let  N  =  1;  R  =  0; 

18:  for  each  (s,r)  in  ChildSeq  from  the  beginning  do 

R  =  R  +  r; 

20:  Ct+i(i,s,a)  [-Rj(s')  -  (gl  -  gf)N  +  Ct(i,  s' ,a*)  +  ht(a*,s')} 

Ct+i(i,  s,  a)  <r^~  [ Ct(i ,  s',  a*)  +  ht(a*,  s')  -  (gl  -  g?)N] 

22:  N  =  N  +  1 

if  a  is  a  non-random  action  then 

24:  update  average  reward  or  gain  of  subtask  i 

A  _  -0+i(i)  _  rt(i)+R 

st+1  nt+i(i)  nt(i)+N 

26:  end  if 

end  for 

28:  append  ChildSeq  onto  the  front  of  Seq 

s  =  s' 

30:  end  while 

end  if 

32:  return  Seq 
end  MAXQ 
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We  can  re-state  Equation  14  for  action-value  function  decomposition  as  follows: 

r  oo 

Rn(i,  s,  a )  =  hn(a,  s )  -  (g1  -  ga)yi(s ,  a)  +  V]  -Pj(s'|s,  a)  /  iT(i,  s',  7Tj(s'))Fj((it|s,  a) 
In  the  above  equation,  the  term 

/•OO 

-(#*  -  9a)yi(s,a)  +  V]  P,:(s'|s,a)  / 

s'eSi 


denotes  the  average  adjusted  reward  of  completing  task  Mt  after  executing  action  a  in  state 
s.  This  term  is  called  the  completion  function  and  is  denoted  by  Cn(i,s,a).  With  this 
definition,  we  can  express  the  R  function  recursively  as 

Rn(i,  s,  a )  =  hw(a,  s )  +  Cn(i,  s,  a) 


and  we  can  re-express  the  definition  for  h  as 


hn(i,  s) 


< 


R*(i,  S,TTi(s)) 

/•oo  _  ru 

k(s,i)+  /  y2  Pi(s'\s,i)[  c(s',  s,  i)dt\Fi(d 

Jo  s'£Si  Jo 

/•OO 

-  gl  Pj(s'\s,i)  /  tFi(dt\s,i) 

c/rC.  ^0 


if  i  is  composite 
tt|s,  i) 

if  i  is  primitive 


The  above  formulas  can  be  used  to  obtain  update  equations  for  h  function,  outside  com¬ 
pletion  function  C  and  inside  completion  function  C  in  the  continuous-time  average  reward 
model.  Pseudo-code  for  the  resulting  algorithm  is  shown  in  Algorithm  3  (Ghavamzadeh 
and  Mahadevan,  2001)  As  mentioned  above,  all  subtasks  in  the  hierarchy,  even  primitive 
actions,  are  modeled  by  a  unichain  SMDP. 


8.  Multiagent  MAXQ  Algorithm  (Cooperative  MAXQ) 

The  MAXQ  decomposition  of  the  Q-function  relies  on  a  key  principle:  the  reward  function 
for  the  parent  task  is  the  value  function  of  the  child  task  (see  Equation  5).  We  show  how 
this  idea  can  be  extended  to  joint-action  values.  The  most  salient  feature  of  the  extended 
MAXQ  algorithm,  which  is  proposed  in  this  section,  is  that  the  top  level (s)  (the  level 
immediately  below  the  root,  and  perhaps  lower  levels)  of  the  hierarchy  is  (are)  configured 
to  store  the  completion  function  (C)  values  for  joint  (abstract)  actions  of  all  agents.  The 
completion  function  (W (i,  s,  a1 ,  a2 ...a^ ...an)  is  defined  as  the  expected  discounted  reward  of 
completion  of  subtask  a3  by  agent  j  in  the  context  of  the  other  agents  performing  subtasks 
a*,  Vi  G  {1, ...,  n},  i  /  j  (Makar  et  al.,  2001). 

More  precisely,  the  decomposition  equations  used  for  calculating  the  projected  value 
function  V  have  the  following  form  (for  agent  j ) 


V3(i,  s,  a1, 


A-l  n3+ 1 


,a 


maxaj  Q3  (i,  s,  a1. .  .a3  . . .  an)  if  i  is  composite 
)T)S,  P(-s'  |  s,i)R(s'  |  s,i)  if  i  is  primitive 
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Algorithm  3  The  continuous-time  average  reward  MAXQ  algorithm, 
function  MAXQ(MaxNode  i,  State  s ) 

2:  let  Seq={}  be  the  sequence  of  (states  visited,  transition  times,  reward)  while  executing 
i 

if  i  is  a  primitive  MaxNode  then 

4:  execute  action  i  in  state  s,  observe  state  s'  in  r  time  units,  receive  lump  portion  of 

reward  k(s,i)  and  continuous  portion  of  reward  with  rate  r(s',  s,i) 
ht+i(i,  s )  [k(s,  i)  +  r(s',  s,  i)r  -  g\r ] 

6:  if  i  is  a  non-random  action  then 

update  average  reward  or  gain  of  subtask  i 

i  _  rt+i(i)  _  rt(i)+k(s,i)+r(s' ,s,i)r 
St+1  tt(i)+T 

end  if 

10:  push  (state  s,  transition  time  r,  reward  p  =  k(s,i)  +  r(s/,s,*)r)  into  the  beginning 

of  Seq 
else 

12:  while  i  has  not  terminated  do 

choose  action  a  according  to  the  current  exploration  policy  7 r*(s) 

14:  let  ChildSeq=MAXQ(a,s),  where  ChildSeq  is  the  sequence  of  (states  visited,  tran¬ 

sition  times)  while  executing  action  a 
observe  result  state  s' 

16:  let  a*  =  argmaxaieA.(si')  [Ct(i,  s\  a’)  +  ht{a',  s')] 

T  =  0;  R  =  0; 

18:  for  (s ,r,p)  in  ChildSeq  from  the  beginning  do 


20: 

22: 

24: 

26: 

28: 

30: 

32: 


T  =  T  +  t;  R  =  R  +  p; 

Ct+i(i,s,a)  [Ri(s')  -  (glt  -  gf)T  +  Ct(i,s',a*)  +  ht(a*,s')\ 
'  Ct+i(i,s,a )  ^  [Ct(i,s',a*)  +  ht(a*,s')  -  {g\  -  g?)T} 
if  a  is  a  non-random  action  then 

update  average  reward  or  gain  of  subtask  i 
ni  _  n+i (d  _  rt(i)+R 
gt+ 1  “  tt+i(i)  ~  tt{i)+T 

end  if 

end  for 

append  ChildSeq  onto  the  front  of  Seq 
s  =  s' 

end  while 
end  if 
return  Seq 
end  MAXQ 
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Qj(i,  s,  a1. .  .a-7  . . .  an )  =  V^a? ,  s)  +  C-* (i,  s,  a1  . . .  a7  . . .  an )  (15) 

at  the  highest  (or  lower  than  the  highest  as  needed)  level(s)  of  the  hierarchy,  where  joint 
action  values  are  being  modeled,  and  aJ  is  the  action  being  performed  by  agent  j.  Compare 
the  decomposition  in  Equation  5  with  Equation  15.  Given  a  MAXQ  hierarchy  M  for  any 
given  task,  we  need  to  find  the  highest  level  at  which  this  equation  provides  a  sufficiently 
good  approximation  of  the  true  value.  For  both  the  AGV  and  the  trash  collection  domain, 
the  subtasks  immediately  below  the  root  seem  to  be  a  good  compromise  between  good 
performance  and  reducing  the  number  of  joint  state  action  values  that  need  to  be  learned. 

To  illustrate  the  multiagent  MAXQ  algorithm,  for  the  two-robot  trash  collection  task, 
if  we  set  up  the  joint  action-values  at  only  the  highest  level  of  the  MAXQ  graph,  we  get  the 
following  value  function  decomposition  for  Agentl: 

Qj(Root,  s,  NavTl,  NavT2)  =  V^(NavTl,  s )  +  Cl(Root ,  s,  NavTl,  NavT2) 

which  represents  the  value  of  Agent  1  doing  task  NavTl  in  the  context  of  the  overall  Root 
task,  when  Agent2  is  doing  task  NavT2.  Note  that  this  value  is  decomposed  into  the  value 
of  the  NavTl  subtask  itself  and  the  completion  cost  of  the  remainder  of  the  overall  task.  In 
this  example,  the  multiagent  MAXQ  decomposition  embodies  the  heuristic  that  the  value 
of  Agentl  doing  the  subtask  NavTl  is  independent  of  whatever  Agent2  is  doing. 

A  recursive  algorithm  is  used  for  learning  the  C  values.  Thus,  an  agent  starts  from 
the  root  task  and  chooses  a  subtask  till  it  gets  to  a  primitive  action.  The  primitive  action 
is  executed,  the  reward  observed,  and  the  leaf  V  values  updated.  Whenever  any  subtask 
terminates,  the  C(i,  s,  a)  values  are  updated  for  all  states  visited  during  the  execution  of  that 
subtask.  Similarly,  when  one  of  the  tasks  at  the  level  just  below  the  root  task  terminates, 
the  C(i ,  s,  a1, . . .  ,  an )  values  are  updated  according  to  the  MAXQ  learning  algorithm. 

9.  The  AGV  Scheduling  Task 

Automated  Guided  Vehicles  (AGVs)  are  used  in  flexible  manufacturing  systems  (FMS)  for 
material  handling  (Askin  and  Standridge,  1993).  They  are  typically  used  to  pick  up  parts 
from  one  location,  and  drop  them  off  at  another  location  for  further  processing.  Locations 
correspond  to  workstations  or  storage  locations.  Loads  which  are  released  at  the  dropoff 
point  of  a  workstation  wait  at  its  pick  up  point  after  the  processing  is  over,  so  the  AGV  is 
able  to  take  it  to  the  warehouse  or  some  other  locations.  The  pickup  point  is  the  machine  or 
workstation’s  output  buffer.  Any  FMS  system  using  AGVs  faces  the  problem  of  optimally 
scheduling  the  paths  of  AGVs  in  the  system(Lee,  1996).  For  example,  a  move  request  occurs 
when  a  part  finishes  at  a  workstation.  If  more  than  one  vehicle  is  empty,  the  vehicle  which 
would  service  this  request  needs  to  be  selected.  Also,  when  a  vehicle  becomes  available, 
and  multiple  move  requests  are  queued,  a  decision  needs  to  be  made  as  to  which  request 
should  be  serviced  by  that  vehicle.  These  schedules  obey  a  set  of  constraints  that  reflect 
the  temporal  relationships  between  activities  and  the  capacity  limitations  of  a  set  of  shared 
resources. 

The  uncertain  and  ever  changing  nature  of  the  manufacturing  environment  makes  it 
virtually  impossible  to  plan  moves  ahead  of  time.  Hence,  AGV  scheduling  requires  dynamic 
dispatching  rules,  which  are  dependent  on  the  state  of  the  system  like  the  number  of  parts 
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Figure  4:  A  multiple  automatic  guided  vehicle  (AGV)  optimization  task.  There  are  four 
AGV  agents  (not  shown)  which  carry  raw  materials  and  finished  parts  between 
the  machines  and  the  warehouse. 


in  each  buffer,  the  state  of  the  AGV  and  the  processing  going  on  at  the  workstations.  The 
system  performance  is  generally  measured  in  terms  of  the  throughput,  the  online  inventory, 
the  AGV  travel  time  and  the  flow  time,  but  the  throughput  is  by  far  the  most  important 
factor.  In  this  case,  the  throughput  is  measured  in  terms  of  the  number  of  finished  assemblies 
deposited  at  the  unloading  deck  per  unit  time.  Since  this  problem  is  analytically  intractable, 
various  heuristics  and  their  combinations  are  generally  used  to  schedule  AG  Vs  (Klein  and 
Kim,  1996,  Lee,  1996).  However,  the  heuristics  perform  poorly  when  the  constraints  on  the 
movement  of  the  AGVs  are  reduced. 

Previously,  Tadepalli  and  Ok  (Tadepalli  and  Ok,  1996b)  studied  a  single  agent  AGV 
scheduling  task  using  “flat”  average-reward  reinforcement  learning.  However,  the  multia¬ 
gent  AGV  task  we  study  is  more  complex.  Figure  4  shows  the  layout  of  the  system  used 
for  experimental  purposes  in  this  paper.  Ml  to  MA  show  workstations  in  this  environment. 
Parts  of  type  i  have  to  be  carried  to  drop  off  station  at  workstation  i,  Di,  and  the  assembled 
parts  brought  back  from  pick  up  stations  of  workstations,  PjS,  to  the  warehouse.  The  AGV 
travel  is  unidirectional  (as  the  arrows  show). 

The  termination  predicate  has  been  redefined  to  take  care  of  the  fact  that  the  completion 
of  certain  tasks  might  depend  on  the  occurrence  of  an  event  rather  than  just  a  state  of  the 
environment.  For  example,  if  we  consider  the  DM1  subtask  in  the  AGV  problem  (see 
Figure  5),  the  state  of  the  system  at  the  beginning  of  the  subtask  might  be  the  same  as  that 
at  the  end,  as  the  system  is  very  dynamic.  New  parts  continuously  arrive  at  the  warehouse, 
and  the  machines  start  and  end  work  on  parts  at  random  intervals.  Also,  the  actions  of 
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(b) 

Figure  5:  MAXQ  graph  for  the  AGV  scheduling  task. 


a  number  of  agents  affects  the  environment.  This  kind  of  discrete  event  model  makes  it 
necessary  to  have  termination  of  subtasks  to  be  defined  in  terms  of  events.  Hence,  a  subtask 
terminates  when  the  event  associated  with  that  subtask  is  triggered  by  the  robot  performing 
the  subtask,  for  example  DM1  subtask  terminates  when  the  “unload  of  material  1  at  drop 
off  station  of  machine  1”  event  occurs. 


9.1  State  Abstraction 


The  state  of  the  environment  consists  of  the  number  of  parts  in  the  pickup  station  and  in 
the  dropoff  station  of  each  machine,  and  whether  the  warehouse  contains  parts  of  each  of 
the  four  types.  In  addition,  each  agent  keeps  track  of  its  own  location  and  state  as  a  part 
of  the  state  space.  Thus,  in  the  flat  case,  the  size  of  the  state  space  is  ~  100  locations,  3 
parts  in  each  buffer,  9  possible  states  of  the  AGV  (carrying  Parti,  ...,  carrying  Assemblyl, 
...,  Empty),  and  2  values  for  each  part  in  the  warehouse,  i.e.  100  x  48  x  9  x  24  ~  230,  which 
is  enormous.  The  MAXQ  state  abstraction  helps  in  reducing  the  state  space  considerably. 
Only  the  relevant  state  variables  are  used  while  storing  the  completion  functions  in  each 
node  of  the  task  graph.  For  example,  for  the  Navigate  subtask,  only  the  location  state 
variable  is  relevant,  and  this  subtask  can  be  learned  with  100  values.  Hence,  for  the  highest 
level  actions  DM1,  . . . ,  DM4,  the  number  of  relevant  states  would  be  100  x9x4x2k  213, 
and  for  highest  level  actions  DAI,  . . . ,  DA4,  the  number  of  states  would  be  100  x  9  x  4  ~  212. 
For  the  lower  level  state  space,  the  action  with  the  largest  state  space  is  Navigate  with  100 
values.  This  state  abstraction  gives  us  a  compact  way  of  representing  the  C  functions,  and 
speeds  up  the  algorithm  (Dietterich,  2000). 
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10.  Experimental  Results 

We  first  describe  experiments  in  the  simple  two-robot  trash  collection  problem,  and  then 
we  will  turn  to  the  more  complex  multiagent  AGV  task  and  apply  the  cooperative  discrete 
and  continuous  time  MAXQ  algorithms  to  this  problem. 

10.1  Trash  Collection  Task  (Discrete  Time  Model) 

We  first  provide  more  details  of  how  we  implemented  the  trash  collection  task.  In  the  single 
agent  scenario,  one  robot  starts  in  the  middle  of  Room  1  and  learns  the  task  of  picking  up 
trash  from  T1  and  T2  and  depositing  it  into  the  Dump.  The  goal  state  is  reached  when  trash 
from  both  T1  and  T2  has  been  deposited  in  Dump.  The  state  space  here  is  the  orientation 
of  the  robot  (N,S,W,E),  and  another  component  based  on  its  percept.  We  assume  that  a 
ring  of  16  sonars  would  enable  the  robot  to  find  out  whether  it  is  in  a  corner,  (with  two 
walls  perpendicular  to  each  other  on  two  sides  of  the  robot),  near  a  wall  (with  wall  only  on 
one  side),  near  a  door  (wall  on  either  side  of  an  opening),  in  a  corridor  (parallel  walls  on 
either  side)  or  in  an  open  area  (the  middle  of  the  room).  Thus,  each  room  is  divided  into  9 
states,  and  the  corridor  into  4  states.  Thus,  we  have  ((9  x  3)  +  4)  x  4,  or  124  locations  for 
a  robot.  Also,  the  trash  object  from  trash  basket  Tl  can  be  at  Tl,  carried  with  robot,  or 
at  Dump,  and  the  trash  object  from  trash  basket  T 2  can  be  at  T2,  carried  by  robot,  or  at 
Dump.  Thus  the  total  number  of  environment  states  is  124  x  3  x  3,  or  1116  for  the  single 
agent  case.  Going  to  the  two-agent  case  would  mean  that  the  trash  can  be  at  either  Tl  or 
T2,  Dump,  or  carried  by  one  of  the  two  robots.  Thus,  in  the  flat  case,  the  size  of  the  state 
space  would  grow  to  124  x  124  x  4  x  4,  or  «  24  x  104. 

The  environment  is  fully  observable  given  this  state  decomposition,  as  the  direction 
which  the  robot  is  facing,  in  combination  with  the  percept  (which  includes  the  room  the 
agent  is  in)  gives  a  unique  value  for  each  location.  The  primitive  actions  considered  here 
are  behaviors  to  find  a  wall  in  one  of  the  four  directions,  align  with  the  wall  on  left  or  right 
side,  follow  wall,  enter  or  exit  door,  align  south  or  north  in  the  corridor,  or  move  in  the 
corridor. 

In  the  two-robot  trash  collection  task,  examination  of  the  learned  policy  in  Figure  6 
reveals  that  the  robots  have  nicely  learned  all  three  skills:  how  to  achieve  a  subtask,  what 
order  to  do  them  in,  and  how  to  coordinate  with  other  agents.  In  addition,  as  Figure  7 
confirms,  the  number  of  steps  needed  to  do  the  trash  collection  task  is  greatly  reduced  when 
the  two  agents  coordinate  to  do  the  task,  compared  to  when  a  single  agent  attempts  to  carry 
out  the  whole  task. 

10.2  AGV  Domain  (Discrete-Time  Model) 

We  now  present  detailed  experimental  results  on  the  AGV  scheduling  task,  comparing 
several  learning  agents,  including  a  single  agent  using  MAXQ,  selfish  multiple  agents  using 
MAXQ  (where  each  agent  acts  independently  and  learns  its  own  optimal  policy),  and  the 
new  cooperative  multiagent  MAXQ  approach.  In  this  domain,  there  are  four  agents  (each 
AGV  is  an  agent). 

The  experimental  results  were  generated  with  the  following  model  parameters.  The 
inter-arrival  time  for  parts  at  the  warehouse  is  uniformly  distributed  with  a  mean  of  4  sec 
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Learned  Policy  for  Agent  1 

root 

navigate  to  trash  1 

go  to  location  of  trash  1  in  room  1 
pick  trash  1 
navigate  to  bin 
exit  room  1 
enter  room  3 

go  to  location  of  dump  in  room  3 
put  trash  1  in  dump 
end 


Learned  Policy  for  Agent  2 

root 

navigate  to  trash  2 

go  to  location  of  trash  2  in  room  1 
pick  trash  2 
navigate  to  bin 
exit  room  1 
enter  room  3 

go  to  location  of  dump  in  room  3 
put  trash  2  in  dump 
end 


Figure  6:  This  figure  shows  the  policy  learned  by  the  cooperative  multiagent  MAXQ  algo¬ 
rithm  in  the  trash  collection  task. 


and  variance  of  1  sec.  The  percentage  of  Parti,  Part2,  Part3  and  Part4  in  the  part  arrival 
process  are  20,  28,  22  and  30  respectively.  The  time  required  for  assembling  the  various 
parts  is  uniformly  distributed  with  means  15,  24,  24  and  30  sec  for  Parti,  Part2,  Part3 
and  Part4  respectively,  and  variance  2  sec.  The  execution  time  of  primitive  actions  (AGV 
navigation  actions,  load  and  unload)  is  1000  micro  sec.  The  execution  time  for  idle  action 
is  1  sec.  Each  experiment  was  conducted  five  times  and  the  results  averaged. 

Figure  8  shows  the  throughput  of  the  system  for  the  three  types  of  approaches.  As  seen 
in  Figure  8,  the  agents  learn  a  little  faster  initially  in  the  selfish  multiagent  method,  but 
after  some  time,  undulations  are  seen  in  the  graph  showing  not  only  that  the  algorithm  does 
not  stabilize,  but  also  that  it  results  in  sub-optimal  performance.  This  is  due  to  the  fact  that 
two  or  more  agents  select  the  same  action,  but  once  the  first  agent  completes  the  task,  the 
other  agents  might  have  to  wait  for  a  long  time  to  complete  the  task,  due  to  the  constraints 
on  the  number  of  parts  that  can  be  stored  at  a  particular  place.  The  system  throughput 
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Figure  7:  Number  of  actions  needed  to  complete  the  trash  collection  task. 


Time  since  start  of  simulation  (sec) 


Figure  8:  This  figure  shows  that  the  cooperative  multiagent  MAXQ  approach  outperforms 
both  the  selfish  (non-cooperative)  and  single  agent  MAXQ  approaches  when  the 
AGV  travel  time  is  very  much  less  compared  to  the  assembly  time.  Learning 
curves  are  averaged  over  five  runs. 


achieved  using  the  new  cooperative  multiagent  MAXQ  method  is  significantly  higher  than 
the  single  agent  or  selfish  multiagent  case.  This  difference  is  even  more  significant  in  figure  9, 
as  when  the  primitive  actions  have  longer  execution  time,  almost  the  average  assembly 
time  (the  execution  time  of  primitive  actions  is  2  sec). 

Figure  10  shows  results  from  an  implementation  of  a  single  flat  Q-Learning  agent  with 
the  buffer  capacity  at  each  station  set  at  1.  As  can  be  seen  from  the  plot,  the  flat  algo¬ 
rithm  converges  extremely  slowly.  The  throughput  at  70,000  sec  has  gone  up  to  only  0.07, 
compared  with  2.6  for  the  hierarchical  single  agent  case.  Figure  11  compares  the  coopera¬ 
tive  multiagent  MAXQ  algorithm  with  several  well-known  AGV  scheduling  rules,  showing 
clearly  the  improved  performance  of  the  reinforcement  learning  method.  Finally,  Figure  12 
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Figure  9:  This  figure  compares  the  cooperative  multiagent  MAXQ  approach  with  the  selfish 
(non-cooperative)  MAXQ  approach,  when  the  AGV  travel  time  and  load/unload 
time  is  the  average  assembly  time.  Learning  curves  are  averaged  over  five 
runs. 


shows  that  when  the  Q-nodes  at  the  top  two  levels  of  the  hierarchy  are  configured  to  repre¬ 
sent  joint  action-values,  learning  is  considerably  slower  (since  the  number  of  parameters  is 
increased  significantly),  and  the  overall  performance  is  not  better.  The  lack  of  improvement 
is  due  in  part  to  the  fact  that  the  second  layer  of  the  MAXQ  hierarchy  is  concerned  with 
navigation.  Adding  joint  actions  does  not  help  improve  navigation  because  coordination  is 
not  necessary  in  this  environment.  However,  it  might  turn  out  that  adding  joint  actions  in 
multiple  layers  will  be  worthwhile,  even  if  convergence  is  slower,  due  to  better  overall  task 
performance. 

10.3  AGV  Domain  (Continuous-Time  Model) 

We  now  apply  the  two  proposed  continuous-time  algorithms  described  in  section  4  to  the 
AGV  scheduling  task  and  compare  their  performance  and  speed  with  each  other,  as  well  as 
several  well-known  AGV  scheduling  heuristics. 

The  experimental  results  were  generated  with  the  following  model  parameters.  The 
inter-arrival  time  for  parts  at  the  warehouse  is  uniformly  distributed  with  a  mean  of  4  sec 
and  variance  of  1  sec.  The  percentage  of  Parti,  Part2,  Part3  and  Part4  in  the  part  arrival 
process  are  20,  28,  22  and  30  respectively.  The  time  required  for  assembling  the  various 
parts  is  normally  distributed  with  means  15,  24,  24  and  30  sec  for  Parti,  Part2,  Part3 
and  Part4  respectively,  and  the  variance  2  sec.  The  execution  time  of  primitive  actions 
(AGV  navigation  actions,  load  and  unload)  is  normally  distributed  with  mean  1000  micro 
sec  and  variance  50  micro  sec.  The  execution  time  of  idle  action  is  normally  distributed 
with  mean  1  sec  and  variance  0.1  sec.  Table  1  shows  the  value  of  all  the  parameters  of  the 
continuous-time  model  used  in  the  experimental  results  of  this  section.  Each  experiment 
was  conducted  five  times  and  the  results  averaged. 
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Figure  10:  A  flat  Q-learner  learns  the  AGV  domain  extremely  slowly,  showing  the  need 
for  using  a  hierarchical  task  structure.  Note  the  y  axis  of  this  plot  is  greatly 
expanded  upward  comparing  to  Figure  8. 


Figure  11:  This  plot  shows  the  multiagent  MAXQ  outperforms  three  well-known  widely 
used  (industrial)  heuristics  for  AGV  scheduling.  Learning  curves  are  averaged 
over  five  runs. 
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Figure  12:  This  plot  compares  the  performance  of  the  multiagent  MAXQ  algorithm  with 
joint  actions  at  the  top  level  vs.  joint  actions  at  the  top  two  levels.  Learning 
curves  are  averaged  over  five  runs. 


Table  1:  Model  Parameters 


Parameter 

Type  of  Distribution 

Mean 

Variance 

Idle  Action 

Normal 

1  (sec) 

0.1  (sec) 

Primitive  Actions 

Normal 

1000  (micro  sec) 

50  (micro  sec) 

Assembly  Time  for  Parti 

Normal 

15  (sec) 

2  (sec) 

Assembly  Time  for  Part2 

Normal 

24  (sec) 

2  (sec) 

Assembly  Time  for  Part3 

Normal 

24  (sec) 

2  (sec) 

Assembly  Time  for  Part4 

Normal 

30  (sec) 

2  (sec) 

Inter- Arrival  Time  for  Parts 

Uniform 

4  (sec) 

1  (sec) 
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Figure  13:  This  plot  shows  continuous-time  average  reward  multiagent  MAXQ  algorithm 
outperforms  continuous-time  discounted  reward  multiagent  MAXQ  algorithm. 
Learning  curves  averaged  over  five  runs. 
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Figure  14:  This  plot  shows  both  continuous-time  average  reward  and  discounted  reward 
multiagent  MAXQ  algorithms  outperform  three  well-known  widely  used  (indus¬ 
trial)  heuristics  for  AGV  scheduling.  Learning  curves  averaged  over  five  runs. 


Figure  13  shows  the  throughput  of  the  system  for  continuous-time  discounted  and  aver¬ 
age  reward  MAXQ  algorithms  proposed  in  this  paper.  As  seen  in  this  figure,  the  agents  learn 
a  little  faster  initially  in  the  discounted  reward  method,  but  the  final  system  throughput 
achieved  using  the  average  reward  algorithm  is  higher  than  the  discounted  reward  case. 

Figure  14  compares  the  proposed  continuous-time  MAXQ  algorithms  with  several  well- 
known  AGV  scheduling  rules,  highest  queue  first,  nearest  station  first  and  first  come  first 
serve,  showing  clearly  the  improved  performance  of  the  reinforcement  learning  methods. 


29 


Ghavamzadeh,  Mahadevan  &  Makar 


11.  Conclusion  and  Future  Work 

This  paper  extended  the  framework  of  hierarchical  reinforcement  learning  to  continuous¬ 
time,  average-reward,  and  multiagent  domains.  Using  the  MAXQ  framework  as  an  example, 
we  described  three  new  hierarchical  reinforcement  learning  algorithms:  continuous-time  dis¬ 
counted  reward  MAXQ ,  discrete-time  average  reward  MAXQ ,  and  continuous-time  average 
reward  MAXQ.  We  also  extended  the  MAXQ  framework  to  the  multiagent  case  ( cooperative 
MAXQ),  where  each  agent  uses  the  same  task  hierarchy.  Learning  is  decentralized,  with 
each  agent  learning  three  interrelated  skills:  how  to  perform  subtasks,  which  order  to  do 
them  in,  and  how  to  coordinate  with  other  agents.  Coordination  skills  among  agents  are 
learned  by  using  joint  actions  at  the  highest  level(s)  of  the  hierarchy. 

The  effectiveness  of  the  proposed  algorithms  were  tested  using  two  experimental  testbeds: 
a  simulated  robot  trash  collection  domain,  and  a  much  larger  real-world  multi- agent  au¬ 
tonomous  guided  vehicle  (MAGV)  domain.  The  proposed  algorithms  performed  well  in  both 
domains,  and  in  particular,  in  the  MAGV  domain,  we  showed  that  our  proposed  extensions 
outperform  widely  used  industrial  heuristics,  such  as  “ first  come  first  serve!' ,  "highest  queue 
firsf  and  ”  nearest  station  firsf . 

There  are  a  number  of  directions  for  future  work  which  can  be  briefly  outlined.  In 
the  continuous-time  and  average- reward  extensions  of  MAXQ,  an  immediate  question  that 
arises  is  the  convergence  of  the  algorithms  to  recursively  optimal  policies.  These  results 
should  provide  some  theoretical  validity  to  the  proposed  methods,  in  addition  to  their 
empirical  effectiveness  demonstrated  in  this  paper.  The  multiagent  extension  to  MAXQ 
is  more  difficult  to  analyze  theoretically,  due  to  the  inherent  complexity  of  the  multiagent 
setting.  However,  a  number  of  empirical  extensions  would  be  useful,  from  modeling  the 
cost  of  communication  among  agents,  to  studying  the  scenario  where  agents  begin  with 
dissimilar  task  hierarchies.  Finally,  although  our  work  primarily  focused  on  the  MAXQ 
framework,  the  key  ideas  underlying  our  proposed  methods  could  be  equally  well  applied 
to  other  HRL  frameworks,  such  as  options  and  HAMs. 
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